Making Conversation possible to create directly a full conversation by Narsil · Pull Request #9434 · huggingface/transformers

Narsil · 2021-01-06T09:24:32Z

What does this PR do?

Currently conversations contain some state (conversation.history namely).
There is no obvious way to create a conversation from pure logs aside from mutating state.
The actual result is still buggy because history is not correctly
updated by the Conversation object.

Objectives of this PR:

Enable creation of a Conversation from existing exchanges.
Conversation("Why do you recommend it ?", past_user_inputs=["Can you recommend a book ?"], generated_responses=["I recommend reading the Lord of the Rings."])
Keep relatively close to previous code.
Fix the bug, that simply discarded history if you created a Conversation through mutation of state. (Could be backward incompat)
history renamed _history + _index as it's now treated as a cache variable (namely to prevent recreating tokens of the conversation all the time.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@mfuntowicz
@sgugger

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

fully created from static state.

patrickvonplaten · 2021-01-06T12:50:05Z

That's really cool! Also pinging @guillaume-be here as he might is the original author of the pipeline :-)

+ remove dead enumerate + improve warning message.

sgugger

Looks very cool!

src/transformers/pipelines/conversational.py

sgugger · 2021-01-06T14:18:00Z

src/transformers/pipelines/conversational.py

+                history.extend(new_history)
+            conversation._index = i + index + 1
+        conversation._history = history
+        return history[:]


Suggested change

return history[:]

return history

Same here, and even more important, don't pass back a reference to users to cached values. They can do whatever they want with the copy, if it's a reference then they might mess it up:

history = conversation._get_history() history.append("a") # Now conversation._history also contains "a" if I'm not mistaken

src/transformers/pipelines/conversational.py

sgugger · 2021-01-06T14:20:15Z

tests/test_pipelines_common.py

        self.assertRaises(Exception, nlp, self.invalid_inputs)
+
+
+class DummyTok:


This class uses torch so it should probably be inside an if is_torch_available() block.

Edit: Looks fine in the tests as long as we don't instantiate it without torch present so feel free to ignore this comment.

I will ignore for the time being. I think this might change anyway linked to this discussion: #9432 (comment)

sgugger · 2021-01-06T14:21:27Z

tests/test_pipelines_conversational.py

+
+    @require_torch
+    def test_integration_torch_conversation(self):
+        nlp = self.get_pipeline()


Please, let's all stop naming a pipeline with a generic nlp. It's not the whole field of nlp, but a conversional agent, so conversational_agent or something in the same vein/shorter would be better here :-)

I'm all up for it ! Do you mind if we keep this for a separate PR ?
nlp = pipeline is used pretty much everywhere in tests, I don't want to break uniformity here.

I don't mind, but I'd really like this done. Let me know if you don't plan to tackle it soon and I'll make a good first issue out of it.

Let's do a good first issue, It seems to really be a big replace but it's always nice to have those kind of issues for newcomers, right ?

I agree with @sgugger here. Also I think it's no problem to break "naming" unity in the tests as long as the new name is better. It's easier to do a good first issue if one has this name as a reference of how it should be done IMO

Perfect. I'll change the name here and create First issue, feel free to edit afterwards.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

patrickvonplaten

Cool!

patrickvonplaten · 2021-01-07T08:58:55Z

Also, @Narsil do you if it's possible to have a chat not widget in the inference API for this pipeline? I think it would be really nice to place around Blenderbot and DialoGPT

external dependency)

Narsil · 2021-01-07T15:37:24Z

Also, @Narsil do you if it's possible to have a chat not widget in the inference API for this pipeline? I think it would be really nice to place around Blenderbot and DialoGPT

@patrickvonplaten it's in the pipes, but I've not yet created the widget for huggingface.co, the api-inference is ready though.

@patrickvonplaten, @sgugger can you please re-review. There sort of major bug, where we used

tokenizer.encode(inputs, add_special_tokens=False) so that BOS end EOS were not added on models that required them (instead EOS was added "manually" by the pipeline, leading to poor results on Blenderbot for instance).

Ping @mfuntowicz to make sure we can safely remove that or if there was a strong reason for bypassing tokenizer logic there.

Narsil · 2021-01-07T15:37:39Z

Also changed the tokenizer behavior to use a real one.

guillaume-be · 2021-01-07T15:51:28Z

Thanks for looping me in! It looks like there are a lot of changes, a few comments on my side:

regarding the change from

inputs = self.tokenizer(inputs, add_special_tokens=False, padding=False).get("input_ids", [])
for input in inputs:
   input.append(self.tokenizer.eos_token_id)

to:

inputs = self.tokenizer(inputs, **kwargs).get("input_ids", [])

are you sure that the behaviour remains correct for DialoGPT? As far as I know DialoGPT uses the GPT2 tokenizer that does not add a eos automatically at the end of the encoded input. Test for BlenderBot were added in

transformers/tests/test_pipelines_conversational.py

Line 102 in 74f6f91

def test_integration_torch_conversation_encoder_decoder(self):

and I did not observe a poor performance back then - did something change? Also note that BlenderBot does not seem to require a BOS token (

transformers/src/transformers/models/blenderbot/tokenization_blenderbot.py

Line 57 in f33a6f3

    
           def build_inputs_with_special_tokens(self, token_ids_0: List[int], token_ids_1: List[int] = None):

)

The if len(new_input) > max_length - self.min_length_for_response was set-up to allow the history to leave some space for future responses. Is this now done as part of the history further capabilities?
Could you please clarify the need for _get_history instead of accessing the history directly?
Regarding the title of the PR, if you are interested I added this feature to the Rust version of this pipeline a few months ago. The approach seems simpler than the changes proposed here, am I missing something? See https://github.com/guillaume-be/rust-bert/blob/7890d2daffea8e2c792a2e8930294e403b2321dd/src/pipelines/conversation.rs#L416 for reference (I see from your activity that you are familiar with Rust!)

Thanks!

sgugger

Still LGTM!

Narsil · 2021-01-08T09:10:28Z

Hi @guillaume-be ,

Those changes do not belong in this PR anyway, I'll make a separate PR following this one, we should continue the discussion over there.

LysandreJik

LGTM!

LysandreJik · 2021-01-08T14:36:49Z

It seems the tests are failing in master since this merge: https://app.circleci.com/pipelines/github/huggingface/transformers/18333/workflows/72042bfe-4d42-42de-8389-bc0d1cc5494c/jobs/148896

Narsil · 2021-01-11T09:03:36Z

Yes, was missing a rebase before test, another commit introduced a new warning, which broke the test.

I am not sure what's the strategy concerning warnings and tests. I've tried to be conservative (meaning explicitly testing them), but I know it might become cumbersome at some point, I can remove those checks if needed.

Narsil added 8 commits January 6, 2021 10:13

Cleaning up conversation tests.

a361638

Adding tests that don't require downloading models + conversation can be

ee6aec1

fully created from static state.

Making tests non flaky (by fixing generation length)

f430378

Bumping isort version.

b5c5202

Doc cleanup.

f6a5494

Remove unused test in this PR.

63e9c7b

Torch import guard for TF.

05cb49a

Missing torch guard.

6ec30e1

Narsil added 2 commits January 6, 2021 14:17

Small mistake in doc.

ec75339

Actual uses _history and _index cache.

40dcc75

+ remove dead enumerate + improve warning message.

sgugger approved these changes Jan 6, 2021

View reviewed changes

Narsil and others added 4 commits January 6, 2021 18:23

Update src/transformers/pipelines/conversational.py

c9d3591

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/pipelines/conversational.py

20d4aec

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/pipelines/conversational.py

98fcbd7

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Adding comments and cleaner code to address history copy.

562ec54

patrickvonplaten approved these changes Jan 7, 2021

View reviewed changes

Narsil added 2 commits January 7, 2021 10:14

Improving pipeline name in tests.

c87fdc8

Change tokenizer to a real one (still created at runtime with no

edf5a68

external dependency)

sgugger approved these changes Jan 7, 2021

View reviewed changes

Simplify DummyTok, reverse changes on tokenization.

ae9a54b

LysandreJik approved these changes Jan 8, 2021

View reviewed changes

Removing DummyTok.

16e71d2

Narsil merged commit 02e05fb into master Jan 8, 2021

Narsil deleted the stateless_conversation branch January 8, 2021 13:33

		self.assertRaises(Exception, nlp, self.invalid_inputs)


		class DummyTok:

Conversation

Narsil commented Jan 6, 2021

What does this PR do?

Before submitting

Who can review?

Uh oh!

patrickvonplaten commented Jan 6, 2021

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sgugger Jan 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Jan 7, 2021

Uh oh!

Narsil commented Jan 7, 2021

Uh oh!

Narsil commented Jan 7, 2021

Uh oh!

guillaume-be commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Narsil commented Jan 8, 2021

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik commented Jan 8, 2021

Uh oh!

Narsil commented Jan 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sgugger Jan 6, 2021 •

edited

Loading

guillaume-be commented Jan 7, 2021 •

edited

Loading